feat: Add Amazon S3 Vectors document store integration#3149
feat: Add Amazon S3 Vectors document store integration#3149dotKokott wants to merge 9 commits intodeepset-ai:mainfrom
Conversation
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with: - S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete) - S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering - Filter conversion from Haystack format to S3 Vectors filter syntax - Auto-creation of vector buckets and indexes - AWS credential support via Secret (or default credential chain) - 49 unit tests covering store, retriever, filters, and serialization - README with usage examples and known limitations
…rkflow - boto3 lower bound set to 1.42.0 (when s3vectors service was added) - pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name) - Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern) - Assert full serialized dict structure in to_dict/from_dict tests - Use Mock(spec=...) for retriever tests instead of MagicMock+patch - Verify _embedding_retrieval call args match exactly - Add test_from_dict_no_filter_policy (backward compat) - Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling the mock client). Keep tests that verify our actual logic: - Serialization roundtrip (full dict structure) - Score conversion (cosine + euclidean) - Filter conversion (pure function with real logic) - Duplicate policy batch checks (SKIP/NONE) - Document <-> S3 vector conversion - Input validation Before: 49 unit tests (many testing mock behavior) After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only - write_documents: embedding required, 40KB metadata limit - _embedding_retrieval: top_k=100 cap, no embeddings in response - Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic - Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare with haystack.utils.filters.document_matches_filter (same utility used by InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and date comparison support for free. (-65 lines) - Deduplicate blob/content reconstruction in _embedding_retrieval() by reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines) - Make filter_documents() warning conditional on filters actually being provided (no warning when listing all documents)
1df9666 to
90c4977
Compare
CI: Integration tests need AWS credential setupThe integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a
What needs to happenThe workflow should match the # Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
id: aws-auth
if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
with:
aws-region: us-east-1
role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}
- name: Run integration tests
if: success() && steps.aws-auth.outcome == 'success'
run: hatch run test:integration-cov-append-retryPrerequisites (maintainer action required)
|
|
@dotKokott I'll try to take a look in the next few days. Have you tried the integration yourself in a real-world setting with AWS? |
I have tried all integration tests and examples on my AWS account. However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load. |
anakin87
left a comment
There was a problem hiding this comment.
I left some initial comments.
Will take a better look soon
| @@ -0,0 +1,207 @@ | |||
| # amazon-s3-vectors-haystack | |||
There was a problem hiding this comment.
We want to have a very minimal README (see https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/amazon_bedrock/README.md)
This info is useful but we'll put it in docs
There was a problem hiding this comment.
Sounds good. Should I leave it in the README until we know where you will put the info? Or would you like me to minimize the README already?
There was a problem hiding this comment.
yes let's minimize it, but please save this info somewhere to be re-used in docs
Matches the pattern used by the amazon_bedrock workflow: - top-level id-token: write permission - AWS_REGION env var - configure-aws-credentials step (skipped on fork PRs and dependabot) - integration tests gated on successful auth
Matches the repo convention used across other integrations.
| for doc in batch: | ||
| if doc.embedding is None: | ||
| msg = f"Document '{doc.id}' has no embedding. S3VectorsDocumentStore requires embeddings." | ||
| raise DocumentStoreError(msg) |
There was a problem hiding this comment.
I would do this check not for the batch but initially for all docs, to raise an error before start writing
| from haystack_integrations.components.retrievers.amazon_s3_vectors import S3VectorsEmbeddingRetriever | ||
| from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore | ||
|
|
||
|
|
There was a problem hiding this comment.
For integration tests of Document Stores, we should inherit from Haystack's DocumentStoreBaseExtendedTests, which already contains the necessary tests
- Implementation tips
- see pgvector for an example:
Related Issues
Proposed Changes:
Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.
Components:
S3VectorsDocumentStore— full DocumentStore protocol (write, count, filter, delete)S3VectorsEmbeddingRetriever— embedding-based retrieval with server-side metadata filteringKey design decisions:
1 - distance) for Haystack conventionfilter_documents()useslist_vectors(returnData=True, returnMetadata=True)with client-side filtering (warning logged) since S3 Vectors has no standalone filter APIDuplicatePolicy.SKIP/NONE(batches of 100)Known limitations (documented in README):
top_kcapped at 100 (service limit)query_vectorsdoes not return embedding datafloat32,cosine/euclidean, eventual consistencyHow did you test it?
pytestmarkcredential guard for CIhatch run test:all,hatch run fmt,hatch run test:typesexamples/example.py) verified against live AWSNotes for the reviewer
This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.